[Chapter Sixteen][Previous]
[Next] [Art of
Assembly][Randall Hyde]
Art of Assembly: Chapter Sixteen
- 16.4 - Designing Your Own Pattern Matching
Routines
- 16.5 - Extracting Substrings from Matched
Patterns
16.4 Designing Your Own Pattern Matching Routines
Although the UCR Standard Library provides a wide variety of matching
functions, there is no way to anticipate the needs of all applications.
Therefore, you will probably discover that the library does not support
some particular pattern matching function you need. Fortunately, it is very
easy for you to create your own pattern matching functions to augment those
available in the UCR Standard Library. When you specify a matching function
name in the pattern data structure, the match routine calls the specified
address using a far call and passing the following parameters:
es:di
- Points at the next character in the input string. You
should not look at any characters before this address. Furthermore, you
should never look beyond the end of the string (see cx
below).
ds:si
- Contains the four byte parameter found in the matchparm
field.
cx
- Contains the last position, plus one, in the input string
you're allowed to look at. Note that your pattern matching routine should
not look beyond location es:cx
or the zero terminating byte;
whichever comes first in the input string.
On return from the function, ax
must contain the offset into
the string (di
's value) of the last character matched plus
one, if your matching function is successful. It must also set the carry
flag to denote success. After your pattern matches, the match routine might
call another matching function (the one specified by the next pattern field)
and that function begins matching at location es:ax
.
If the pattern match fails, then you must return the original di
value in the ax
register and return with the carry flag clear.
Note that your matching function must preserve all other registers.
There is one very important detail you must never forget with writing your
own pattern matching routines - ds
does not point at your data
segment, it contains the H.O. word of the matchparm
parameter.
Therefore, if you are going to access global variables in your data segment
you will need to push ds
, load it with the address of dseg
,
and pop ds
before leaving. Several examples throughout this
chapter demonstrate how to do this.
There are some obvious omissions from (the current version of) the UCR Standard
Library's repertoire. For example, there should probably be matchtoistr
,
matchichar
, and matchtoichar
pattern functions.
The following example code demonstrates how to add a matchtoistr
(match up to a string, doing a case insensitive comparison) routine.
.xlist
include stdlib.a
includelib stdlib.lib
matchfuncs
.list
dseg segment para public 'data'
TestString byte "This is the string 'xyz' in it",cr,lf,0
TestPat pattern {matchtoistr,xyz}
xyz byte "XYZ",0
dseg ends
cseg segment para public 'code'
assume cs:cseg, ds:dseg
; MatchToiStr- Matches all characters in a string up to, and including, the
; specified parameter string. The parameter string must be
; all upper case characters. This guy matches string using
; a case insensitive comparison.
;
; inputs:
; es:di- Source string
; ds:si- String to match
; cx- Maximum match position
;
; outputs:
; ax- Points at first character beyond the end of the
; matched string if success, contains the initial DI
; value if failure occurs.
; carry- 0 if failure, 1 if success.
MatchToiStr proc far
pushf
push di
push si
cld
; Check to see if we're already past the point were we're allowed
; to scan in the input string.
cmp di, cx
jae MTiSFailure
; If the pattern string is the empty string, always match.
cmp byte ptr ds:[si], 0
je MTSsuccess
; The following loop scans through the input string looking for
; the first character in the pattern string.
ScanLoop: push si
lodsb ;Get first char of string
dec di
FindFirst: inc di ;Move on to next (or 1st) char.
cmp di, cx ;If at cx, then we've got to
jae CantFind1st ; fail.
mov ah, es:[di] ;Get input character.
cmp ah, 'a' ;Convert input character to
jb DoCmp ; upper case if it's a lower
cmp ah, 'z' ; case character.
ja DoCmp
and ah, 5fh
DoCmp: cmp al, ah ;Compare input character against
jne FindFirst ; pattern string.
; At this point, we've located the first character in the input string
; that matches the first character of the pattern string. See if the
; strings are equal.
push di ;Save restart point.
CmpLoop: cmp di, cx ;See if we've gone beyond the
jae StrNotThere ; last position allowable.
lodsb ;Get next input character.
cmp al, 0 ;At the end of the parameter
je MTSsuccess2 ; string? If so, succeed.
inc di
mov ah, es:[di] ;Get the next input character.
cmp ah, 'a' ;Convert input character to
jb DoCmp2 ; upper case if it's a lower
cmp ah, 'z' ; case character.
ja DoCmp2
and ah, 5fh
DoCmp2: cmp al, ah ;Compare input character against
je CmpLoop
pop di
pop si
jmp ScanLoop
StrNotThere: add sp, 2 ;Remove di from stack.
CantFind1st: add sp, 2 ;Remove si from stack.
MTiSFailure: pop si
pop di
mov ax, di ;Return failure position in AX.
popf
clc ;Return failure.
ret
MTSSuccess2: add sp, 2 ;Remove DI value from stack.
MTSSuccess: add sp, 2 ;Remove SI value from stack.
mov ax, di ;Return next position in AX.
pop si
pop di
popf
stc ;Return success.
ret
MatchToiStr endp
Main proc
mov ax, dseg
mov ds, ax
mov es, ax
meminit
lesi TestString
ldxi TestPat
xor cx, cx
match
jnc NoMatch
print
byte "Matched",cr,lf,0
jmp Quit
NoMatch: print
byte "Did not match",cr,lf,0
Quit: ExitPgm
Main endp
cseg ends
sseg segment para stack 'stack'
stk db 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes db 16 dup (?)
zzzzzzseg ends
end Main
16.5 Extracting Substrings from Matched Patterns
Often, simply determining that a string matches a given pattern is insufficient.
You may want to perform various operations that depend upon the actual information
in that string. However, the pattern matching facilities described thus
far do not provide a mechanism for testing individual components of the
input string. In this section, you will see how to extract portions of a
pattern for further processing.
Perhaps an example may help clarify the need to extract portions of a string.
Suppose you are writing a stock buy/sell program and you want it to process
commands described by the following regular expression:
(buy | sell) [0-9]+ shares of (ibm | apple | hp | dec)
While it is easy to devise a Standard Library pattern that recognizes strings
of this form, calling the match
routine would only tell you
that you have a legal buy or sell command. It does not tell you if you are
to buy or sell, who to buy or sell, or how many shares to buy or sell. Of
course, you could take the cross product of (buy | sell) with (ibm | apple
| hp | dec) and generate eight different regular expressions that uniquely
determine whether you're buying or selling and whose stock you're trading,
but you can't process the integer values this way (unless you willing to
have millions of regular expressions). A better solution would be to extract
substrings from the legal pattern and process these substrings after you
verify that you have a legal buy or sell command. For example, you could
extract buy or sell into one string, the digits into another, and the company
name into a third. After verifying the syntax of the command, you could
process the individual strings you've extracted. The UCR Standard Library
patgrab
routine provides this capability for you.
You normally call patgrab
after calling match
and verifying that it matches the input string. Patgrab
expects
a single parameter - a pointer to a pattern recently processed by match.
Patgrab
creates a string on the heap consisting of the characters
matched by the given pattern and returns a pointer to this string in es:di
.
Note that patgrab
only returns a string associated with a single
pattern data structure, not a chain of pattern data structures. Consider
the following pattern:
PatToGrab pattern {matchstr, str1, 0, Pat2}
Pat2 pattern {matchstr, str2}
str1 byte "Hello",0
str2 byte " there",0
Calling match
on PatToGrab
will match the string
"Hello there". However, if after calling match
you
call patgrab
and pass it the address of PatToGrab
,
patgrab
will return a pointer to the string "Hello".
Of course, you might want to collect a string that is the concatenation
of several strings matched within your pattern (i.e., a portion of the pattern
list). This is where calling the sl_match2
pattern matching
function comes in handy. Consider the following pattern:
Numbers pattern {sl_match2, FirstNumber}
FirstNumber pattern {anycset, digits, 0, OtherDigs}
OtherDigs pattern {spancset, digits}
This pattern matches the same strings as
Numbers pattern {anycset, digits, 0, OtherDigs}
OtherDigs pattern {spancset, digits}
So why bother with the extra pattern that calls sl_match2
?
Well, as it turns out the sl_match2
matching function lets
you create parenthetical patterns. A parenthetical pattern is a pattern
list that the pattern matching routines (especially patgrab
)
treat as a single pattern. Although the match
routine will
match the same strings regardless of which version of Numbers
you use, patgrab
will produce two entirely different strings
depending upon your choice of the above patterns. If you use the latter
version, patgrab
will only return the first digit of the number.
If you use the former version (with the call to sl_match2
),
then patgrab
returns the entire string matched by sl_match2
,
and that turns out to be the entire string of digits.
The following sample program demonstrates how to use parenthetical patterns
to extract the pertinent information from the stock command presented earlier.
It uses parenthetical patterns for the buy/sell command, the number of shares,
and the company name.
.xlist
include stdlib.a
includelib stdlib.lib
matchfuncs
.list
dseg segment para public 'data'
; Variables used to hold the number of shares bought/sold, a pointer to
; a string containing the buy/sell command, and a pointer to a string
; containing the company name.
Count word 0
CmdPtr dword ?
CompPtr dword ?
; Some test strings to try out:
Cmd1 byte "Buy 25 shares of apple stock",0
Cmd2 byte "Sell 50 shares of hp stock",0
Cmd3 byte "Buy 123 shares of dec stock",0
Cmd4 byte "Sell 15 shares of ibm stock",0
BadCmd0 byte "This is not a buy/sell command",0
; Patterns for the stock buy/sell command:
;
; StkCmd matches buy or sell and creates a parenthetical pattern
; that contains the string "buy" or "sell".
StkCmd pattern {sl_match2, buyPat, 0, skipspcs1}
buyPat pattern {matchistr,buystr,sellpat}
buystr byte "BUY",0
sellpat pattern {matchistr,sellstr}
sellstr byte "SELL",0
; Skip zero or more white space characters after the buy command.
skipspcs1 pattern {spancset, whitespace, 0, CountPat}
; CountPat is a parenthetical pattern that matches one or more
; digits.
CountPat pattern {sl_match2, Numbers, 0, skipspcs2}
Numbers pattern {anycset, digits, 0, RestOfNum}
RestOfNum pattern {spancset, digits}
; The following patterns match " shares of " allowing any amount
; of white space between the words.
skipspcs2 pattern {spancset, whitespace, 0, sharesPat}
sharesPat pattern {matchistr, sharesStr, 0, skipspcs3}
sharesStr byte "SHARES",0
skipspcs3 pattern {spancset, whitespace, 0, ofPat}
ofPat pattern {matchistr, ofStr, 0, skipspcs4}
ofStr byte "OF",0
skipspcs4 pattern {spancset, whitespace, 0, CompanyPat}
; The following parenthetical pattern matches a company name.
; The patgrab-available string will contain the corporate name.
CompanyPat pattern {sl_match2, ibmpat}
ibmpat pattern {matchistr, ibm, applePat}
ibm byte "IBM",0
applePat pattern {matchistr, apple, hpPat}
apple byte "APPLE",0
hpPat pattern {matchistr, hp, decPat}
hp byte "HP",0
decPat pattern {matchistr, decstr}
decstr byte "DEC",0
include stdsets.a
dseg ends
cseg segment para public 'code'
assume cs:cseg, ds:dseg
; DoBuySell- This routine processes a stock buy/sell command.
; After matching the command, it grabs the components
; of the command and outputs them as appropriate.
; This routine demonstrates how to use patgrab to
; extract substrings from a pattern string.
;
; On entry, es:di must point at the buy/sell command
; you want to process.
DoBuySell proc near
ldxi StkCmd
xor cx, cx
match
jnc NoMatch
lesi StkCmd
patgrab
mov word ptr CmdPtr, di
mov word ptr CmdPtr+2, es
lesi CountPat
patgrab
atoi ;Convert digits to integer
mov Count, ax
free ;Return storage to heap.
lesi CompanyPat
patgrab
mov word ptr CompPtr, di
mov word ptr CompPtr+2, es
printf
byte "Stock command: %^s\n"
byte "Number of shares: %d\n"
byte "Company to trade: %^s\n\n",0
dword CmdPtr, Count, CompPtr
les di, CmdPtr
free
les di, CompPtr
free
ret
NoMatch: print
byte "Illegal buy/sell command",cr,lf,0
ret
DoBuySell endp
Main proc
mov ax, dseg
mov ds, ax
mov es, ax
meminit
lesi Cmd1
call DoBuySell
lesi Cmd2
call DoBuySell
lesi Cmd3
call DoBuySell
lesi Cmd4
call DoBuySell
lesi BadCmd0
call DoBuySell
Quit: ExitPgm
Main endp
cseg ends
sseg segment para stack 'stack'
stk db 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes db 16 dup (?)
zzzzzzseg ends
end Main
Sample program output:
Stock command: Buy
Number of shares: 25
Company to trade: apple
Stock command: Sell
Number of shares: 50
Company to trade: hp
Stock command: Buy
Number of shares: 123
Company to trade: dec
Stock command: Sell
Number of shares: 15
Company to trade: ibm
Illegal buy/sell command
- 16.4 - Designing Your Own Pattern
Matching Routines
- 16.5 - Extracting Substrings from Matched
Patterns
Art of Assembly: Chapter Sixteen - 29 SEP 1996
[Chapter Sixteen][Previous]
[Next] [Art of
Assembly][Randall Hyde]